According to the Federal Reserve, the housing market has tightened considerably. The supply of homes available for sale has fallen to historically low levels and home price growth has increased greatly during the pandemic. In this tutorial, we aim to analyze how home values have changed in the 21st century across the United States. We will be measuring the changes in Zillow Home Value Index (ZHVI) which is a seasonally adjusted measure of typical home value and market changes.
ZHVI is a measure provided by Zillow Inc. that measures two key variables in the current housing market as well as over time. These two variables are home value and housing market appreciation. For more details about ZHVI and how it is calculated, visit this link. Another link that may be helpful is the ZHVI User Guide.
Using the ZHVI dataset, our goal is to test whether the size of the region affects its ZHVI. In order to do this, we will be going through the data science lifecycle which has five parts:
This data was obtained from https://www.zillow.com/research/data/ under Home Values.
First, let's import some libraries that will be using:
If you do not have any of these libraries installed, you can install them by entering # $ pip3 install [package]. If you need any more information, you can see the documentation or a tutorial for each library listed below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
import statsmodels.formula.api
raw_data = pd.read_csv("data.csv")
raw_data.head()
| RegionID | SizeRank | RegionName | RegionType | StateName | 2000-01-31 | 2000-02-29 | 2000-03-31 | 2000-04-30 | 2000-05-31 | ... | 2021-02-28 | 2021-03-31 | 2021-04-30 | 2021-05-31 | 2021-06-30 | 2021-07-31 | 2021-08-31 | 2021-09-30 | 2021-10-31 | 2021-11-30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 102001 | 0 | United States | Country | NaN | 127104.0 | 127448.0 | 127809.0 | 128546.0 | 129288.0 | ... | 274766.0 | 278419.0 | 282276.0 | 286980.0 | 292503.0 | 298382.0 | 303772.0 | 308393.0 | 312486.0 | 316368.0 |
| 1 | 394913 | 1 | New York, NY | Msa | NY | 223875.0 | 225213.0 | 226416.0 | 228785.0 | 230991.0 | ... | 520343.0 | 524700.0 | 529431.0 | 534932.0 | 542083.0 | 549998.0 | 557578.0 | 563469.0 | 568010.0 | 571556.0 |
| 2 | 753899 | 2 | Los Angeles-Long Beach-Anaheim, CA | Msa | CA | 231151.0 | 231956.0 | 233189.0 | 235533.0 | 238046.0 | ... | 748563.0 | 756432.0 | 767475.0 | 783139.0 | 802944.0 | 823842.0 | 837285.0 | 846395.0 | 851153.0 | 858357.0 |
| 3 | 394463 | 3 | Chicago, IL | Msa | IL | 169017.0 | 169416.0 | 169932.0 | 170965.0 | 172060.0 | ... | 258870.0 | 260970.0 | 263993.0 | 266728.0 | 270524.0 | 274664.0 | 278755.0 | 281808.0 | 284452.0 | 287131.0 |
| 4 | 394514 | 4 | Dallas-Fort Worth, TX | Msa | TX | 130276.0 | 130380.0 | 130466.0 | 130678.0 | 130900.0 | ... | 280524.0 | 284541.0 | 290061.0 | 296469.0 | 303787.0 | 311586.0 | 319478.0 | 326661.0 | 332734.0 | 338194.0 |
5 rows × 268 columns
Now, let's clean the data so we can extract only the relevant details such as Dates, Locations and the ZHVI.
# Melting the dates and converting dates to datetime
data = raw_data.melt(id_vars=["RegionName", "StateName", "RegionID", "SizeRank", "RegionType"],
var_name = "Date",
value_name ="Zillow Home Value Index")
data["Date"] = pd.to_datetime(data['Date'], format='%Y/%m/%d')
# Dropping unnecessary columns
data.drop(["RegionID", "RegionType"], axis = 1, inplace = True)
data.dropna(subset = ["StateName"], inplace = True)
data.head()
| RegionName | StateName | SizeRank | Date | Zillow Home Value Index | |
|---|---|---|---|---|---|
| 1 | New York, NY | NY | 1 | 2000-01-31 | 223875.0 |
| 2 | Los Angeles-Long Beach-Anaheim, CA | CA | 2 | 2000-01-31 | 231151.0 |
| 3 | Chicago, IL | IL | 3 | 2000-01-31 | 169017.0 |
| 4 | Dallas-Fort Worth, TX | TX | 4 | 2000-01-31 | 130276.0 |
| 5 | Philadelphia, PA | PA | 5 | 2000-01-31 | 129615.0 |
The data now looks clean but there could be some missing values. Let's check!
data["Zillow Home Value Index"].isna().sum()
48727
In order to impute the missing values, we are going to calculate the average ZHVI for the whole state on that date.
# in order to impute the n/a values, calculate the avg index for that state on that date
avg_state_date = data.groupby(["StateName", "Date"]).mean()
avg_state_date.reset_index(level = ["StateName", "Date"], inplace = True)
avg_state_date.rename(columns ={"Zillow Home Value Index": "Average Zillow Home Value Index for State on Date"}, inplace = True)
avg_state_date.drop(["SizeRank"], axis = 1, inplace = True)
# Merging back to original dataframe and filling the holes
data = pd.merge(data, avg_state_date, how = "left", on = ["StateName", "Date"])
data["Zillow Home Value Index"].fillna(data["Average Zillow Home Value Index for State on Date"], inplace = True)
data["Zillow Home Value Index"].isna().sum()
3468
There are still some missing values. Let's fill those with average ZHVI for the whole region across our window of time.
avg_region = data.groupby(["RegionName"]).mean()
avg_region.drop(["Average Zillow Home Value Index for State on Date", "SizeRank"], axis =1, inplace = True)
avg_region.reset_index(level = ["RegionName"], inplace = True)
avg_region.rename(columns ={"Zillow Home Value Index": "Average Zillow Home Value Index for Region across Time"}, inplace = True)
data = pd.merge(data, avg_region, how = "left", on = ["RegionName"])
data["Zillow Home Value Index"].fillna(data["Average Zillow Home Value Index for Region across Time"], inplace = True)
data["Zillow Home Value Index"].isna().sum()
0
Let's just calculate the average ZHVI for states as well.
avg_state = data.groupby(["StateName"]).mean()
avg_state.drop(["Average Zillow Home Value Index for State on Date", "SizeRank", "Average Zillow Home Value Index for Region across Time"], axis = 1, inplace = True)
avg_state.reset_index(level = ["StateName"], inplace = True)
avg_state.rename(columns ={"Zillow Home Value Index": "Average Zillow Home Value Index for State across Time"}, inplace = True)
data = pd.merge(data, avg_state, how = "left", on = ["StateName"])
data.head()
| RegionName | StateName | SizeRank | Date | Zillow Home Value Index | Average Zillow Home Value Index for State on Date | Average Zillow Home Value Index for Region across Time | Average Zillow Home Value Index for State across Time | |
|---|---|---|---|---|---|---|---|---|
| 0 | New York, NY | NY | 1 | 2000-01-31 | 223875.0 | 79925.461538 | 407514.752852 | 131142.310067 |
| 1 | Los Angeles-Long Beach-Anaheim, CA | CA | 2 | 2000-01-31 | 231151.0 | 221478.900000 | 505735.870722 | 358798.080588 |
| 2 | Chicago, IL | IL | 3 | 2000-01-31 | 169017.0 | 90254.285714 | 222440.041825 | 107777.913740 |
| 3 | Dallas-Fort Worth, TX | TX | 4 | 2000-01-31 | 130276.0 | 103721.555556 | 177825.041825 | 121059.292274 |
| 4 | Philadelphia, PA | PA | 5 | 2000-01-31 | 129615.0 | 92338.666667 | 216507.528517 | 133913.688633 |
Finally, the data is ready to be disected. Lets move on to anaylsis.
Here is the how ZHVI has changed for every state across time:
n = data.StateName.nunique()
distinct_colors = plt.cm.gist_rainbow(np.linspace(0, 1, n))
fig, ax = plt.subplots(figsize = (150, 100))
states = data.groupby(["StateName"])
i = 0
for name, group in states:
#group.plot(x = "Date", y = "Zillow Home Value Index", ax = ax, label = name, color = distinct_colors[i], linewidth = 1)
ax.scatter(x = "Date",y = "Zillow Home Value Index", data = group, label = name, s = 500, color = distinct_colors[i])
i = i+1
ax.legend(fontsize = 60)
plt.title("Zillow Home Value Index per State over Time", fontsize = 150)
plt.xlabel("Date", fontsize = 100)
plt.xticks(fontsize = 40)
plt.ylabel("ZHVI in millions", fontsize = 100)
plt.yticks(fontsize = 40);